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1. INTRODUCTION 

In urban environment many public places are crowded most of the times. To ensure safety of the 
people, security forces typically monitor such places with the help of automated crowd event recognition 
systems equipped with closed circuit televisions (CCTV). For such places, special attention needs to be given 
to abnormal activities of the crowd such as sudden dispersion and merging. When people experience some 
threat; they get panic and suddenly start dispersing from each other. If this event is not recognized and 
controlled in time, it may lead to chaotic situation and such situation may claim lives of many people. On the 
other hand, continuous merging of more and more people/their groups at certain place may lead to 
congestion. If this event is not recognized and flow of people is not redirected, it may result in stampede like 
situation. Hence apt recognition of such events is essential to avoid future disasters. 

To detect dispersion or merging, manually watching the CCTV video footages of any place is very 
difficult task. There is always a possibility of missing out some important clue by human operators due to 
fatigue caused by watching long video footages (which may run for several hours). In order to make this 
process less cumbersome and accurate, intelligent surveillance systems are being developed. With this 
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context, in this paper automated method to detect dispersion and merging events of crowd using CCTV 

footage is proposed. 

In computer vision community, different methods are being proposed for crowd scene analysis and 
event recognition. Detailed review of these methods can be found in [1]-[4]. In most of the traditional 
methods [5]-[16], event recognition is performed by analyzing crowd motion. These methods have used 
different handcrafted features such as motion magnitude and direction [5], [8], [10], [14], [15], and [16], 
motion vector instersections [6], optical flow manifolds [9], dynamic textures [11] for characterizing the 
crowd behavior. Classification of events is carried out either in supervised way [5], [8], [14], [15] or 
unsupervised way [7], [9]-[13]. 

Since past few decades, researchers are widely using deep networks for various applications [17]- 
[39]. Al Mamun et al. [30] and Ahmed et al. [36], deep network based methods are proposed to improve the 
accuracy of object tracking tasks in varying illumination conditions. End to end deep architectures are 
proposed for anomaly detection and localization in crowded scenes [17]-[19], [28], and [29]. In these 
methods feature extraction as well as classification is carried out by deep networks. These networks are 
trained on huge data which is in the form of spatio-temporal volumes extracted from video datasets. On the 
other hand, methods like [20], [37] use pre-trained deep networks for extracting features directly from raw 
video frames. Event classification is accomplished with the help of traditional classifiers such as support 
vector machines (SVM). Method proposed by Elmannai and Al-Garni [35] is based on fusion of handcrafted 
features and features extracted by deep network. The authors have used traditional machine learning 
classifiers and majority voting approach for classification. Methods proposed by [21]-[24] do not extract 
features directly from raw video frames; instead they first extract motion information using optical flow/3D 
gradients of images and then use deep neural networks to extract high-level features from this primary 
motion information. In these methods event classification is done using supervised classifiers (like support 
vector machines/convolutional neural networks (CNN)) or unsupervised way such as autoencoders/decoders. 
Few researchers like [26] have proposed fusion of fully convolutional neural networks (FCNN) and optical 
flow features for detection of abnormal events. FCNN features are clustered into a set of binary codes. The 
variations in histogram of binary codes are compared with a statistical measure to detect abnormality. 

Variations in illumination, poor resolution, small size and overlapping of objects are the major 
issues in crowd activity recognition. Many researchers have proposed different handcrafted features but 
extracting accurate and reliable features is challenging. Inaccurate features may degrade the accuracy of 
classification. Deep neural networks have ability to automatically extract features. However, to make the 
deep networks extract appropriate features, they must be first trained on large training data. For crowd 
abnormal event recognition systems, obtaining a big dataset is a big challenge. Even though many videos of 
crowded scenes are available, very few of them consists of sudden dispersion and merging activities. 
Moreover in the available videos, number of frames corresponding to dispersion or merging is very less. This 
is obvious because such abnormal situations do not occur frequently. Thus, ‘class imbalance’ is inevitable 
which adversely affects performance of event recognition systems. 

We address the above mentioned issues in the proposed method. In this method, event recognition is 
accomplished using fusion of deep features with a novel set of handcrafted features. The proposed 
handcrafted features are not affected by occlusion and illumination. Moreover, these features are derived by 
using human intuition/intelligence to provide additional spatial and temporal information which can 
complement the deep features. In this way we assimilate the advantages of modern deep network approach 
and handcrafted features in order to improve the performance of crowd abnormal event recognition systems 
especially for small datasets having very few abnormal event instances. 

Our approach of abnormal event recognition is much different than the methods mentioned in 
previous section. In the proposed method, convolutional neural networks are employed which extract spatial 
information from each frame. Additionally, we propose novel handcrafted features to capture temporal 
variations in the attributes of crowd such as motion magnitude and density. The features extracted by CNN 
are fused with the handcrafted features so that both spatial and temporal features will be available for event 
recognition (detailed explanation of the method in Section 2.3). The salient features of the proposed method 
are: 

Two sets of features are proposed in this work for abnormal event recognition. One is a set of deep features 

while another is a set of handcrafted features. 

Deep features are extracted by convolutional neural network—GoogleNet by transfer learning approach. 

- We propose a completely new set of handcrafted features which represent spatial as well as temporal 
variations in crowd motion patterns. The proposed features are computed without tracking the objects. 
Also, they are not affected by occlusion and illumination and can be used for low to high crowd density 
levels. 
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- The fusion of deep and handcrafted features integrates advantages of both the approaches and thus leads 
to better discriminating feature vector. 
Remaining paper is arranged as: The proposed method is discussed in detail in section 2, detailed 
explanation of datasets used for experimentation is given in Section 3 while in Section 4 analysis of results 
on different datasets is presented. Conclusion and future work are discussed in Section 5. 


2. PROPOSED METHOD 
2.1. Background 

In the proposed method, the focus is on detection of sudden dispersion and merging events of people 
at crowded places. As the aim of proposed work is to distinguish merging and sudden dispersion events from 
normal situation; it is necessary to understand the definitions of merging, sudden dispersion and normal 
situation (Figure 1) before the proposed method is explained in detail. 

- Normal Situation: A situation is said to be normal when either there is no significant movement (refer 
Figure 1(a): Sequence 1.1) or when people are walking coherently (as in Figure 1(b): Sequence 1.2) or they 
are freely roaming (refer Figure 1(c): Sequence 1.3). Thus, in normal situation, average speed of motion 
and density of people generally remains constant w.r.t. time. 


Merging Event: When people continuously gather at a certain location/region, we say it is merging event 
(refer Figure 2, Figure 2(a): Sequence 2.1 and Figure 2(b): Sequence 2.2). In this situation, number of 
people in that region (i.e. density) will go on increasing w.r.t time and may even increase beyond the 
capacity of that place. 


Sudden Dispersion Event: When people (who were initially comfortably moving) suddenly run away from 
each other, we define it as sudden dispersion event (refer Figure 3). In this situation, significant increase in 
motion magnitude and reduction in crowd density is observed. 


(b) 


Figure 1. Examples of normal situation, (a) sequence 1.1: People standing in a group (no significant motion), 
(b) sequence 1.2: People walking coherently, and (c) sequence 1.3: People freely roaming 


2.2. Block diagram 

The block diagram of proposed method is shown in Figure 4. The video stream captured from 
CCTV is processed frame by frame. Every frame is processed simultaneously through the pipeline of deep 
feature extraction and handcrafted feature extraction. These features are then combined to form final 
feature vector which is used further for event classification. The proposed method is explained in detail in 
Section 2.3. 


2.3. Deep feature extraction 


Deep features are extracted from raw frames using pre-trained CNN. We have used standard 22 
layer architecture of GoogLeNet [40] with 9 inception modules used for feature dimensionality reduction 
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without reducing the performance gain. It accepts input images of size 224*224*3, so all the frames are 
resized to this size. The frame passes through multiple sets of convolutional and pooling layers and inception 
modules. The activation function used is rectified linear unit (ReLU)-rectified linear unit which helps to 
overcome the problem of vanishing gradients. We have used initial 10 layers of the network as it is i.e. their 
weights are not modified. However, GoogleNet is originally trained on ImageNet dataset for object 
recognition task. It is necessary to make it learn the features specific to crowd activity recognition task. 
Hence, fine tuning of 12 layers (present in 9 inception modules) is done by retraining them on crowd activity 
recognition datasets. The final feature map is acquired from the last average pooling layer which is a 1024*1 
dimension vector per frame. This feature map gives global/high level information of the input image. 

As compared to other architectures of CNNs (such as LeNet, AlexNet, GoogleNet, and VGGNet), 
pre-trained GoogleNet model requires lesser memory [40]. It uses auxiliary classifiers to overcome the 
problem of overfitting which is typically experienced by very deep networks [40]. Moreover, it uses global 
average pooling layer at the end instead of fully connected layer due to which total learnable parameters are 
significantly reduced as well as accuracy is improved. Hence, GoogleNet is used in the proposed method. 


(b) 


Figure 3. Examples of dispersion situation 


2.4. Handcrafted features 

For every incoming frame, a set of novel handcrafted features is extracted which provides spatial 
and temporal information about crowd. These features are extracted using macroscopic approach of crowd 
density and flow estimation. Detection and tracking of individual objects is completely avoided in the 
proposed method. 

We propose following five features for event recognition: average distance from centroid of crowd, 
rate of change of average distance from centroid, rate of change of average density, average magnitude of 
motion and rate of change of magnitude of motion. Out of these five features, average distance from centroid 
of crowd, rate of change of average distance from centroid and rate of change of average crowd density are 
spatial features. These features are computed from foreground pixels and they represent distance between 
people and density of people. Remaining two are motion features which represent average speed of people. 
The process of ‘handcrafted feature extraction’ is described in this section. 


Recognition of crowd abnormal activities using fusion of handcrafted and ... (Manasi Pathade) 


1080 O ISSN: 2502-4752 


Feature Extraction 
using GoogleNet 


Handcrafted Feature 
Extraction 


Convolution_layer1 M weeer ces arene. coe. eaue aenee evel 
(64 filters, 7°7 / 25) 


112°112°64 J 


MaxPooling (3°3/2S) 


Motion Estimation 


56°56°64 J 
Convolution_layer2 | H 
(192 filters, 3*3 / 1S) | i | | 


56°56°192 | 


Density-Based Motion-Feature 
28°28°192 | | Feature Computation 


Inception Modules (2) 


j 

| 

| 

| = 

| MaxPooling (3°3/2S) 
| 

| 

| 

| 

| 


285-287480 J 


: H Handcrafted Feature | 
MaxPooling (3°3/25) H | 
H Vector | 

f 14°14°480 | i | 


Inception Modules (5) 


14°14°832 | 


MaxPooling (3°3/2S) 


7°7°832 | 


Inception Modules (2) 


7°7°1024 | 
{111024 1°1024 


Average Pooling (7*7/1S) |__| Reshape H————>| Feature Fusion (Concatenation of 
| handcrafted and deep features) 


Dropout 


i i Activity Classification 


MaxPooling 1°1/1S Softmax 


Filter Concatenation 


Merging Normal Situation Dispersion 


l 
! 
! 
I 
1 
! 
I 
Convolution 3°3/1S Convolution 5*5/15 Convolution 1*1/1S l 
! 
L] 
! 
! 
! 
I 
! 


Inception Module 


Figure 4. Block diagram of proposed method 


2.4.1. Spatial feature extraction 
Spatial features are extracted in two steps: foreground detection followed by feature computation; 

- Foreground detection: In this step, moving objects are separated from background. In our application, we 
need to get a constant model of background so that correct foreground can be detected even if objects 
remain stationary for a longer time. Considering this, we have decided to use background subtraction 
approach. This is the simplest and the fastest technique which does not need any background model to be 
predefined. The frame without any moving objects is considered as the background frame. Following 
equations are used to obtain foreground (1) and (2). 


imdiff = |Ieurrent — Ipackgrouna| ” 
I _ (1, imdiff > threshold 2 
foreground ~ 0, otherwise = 
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Threshold is computed using Otsu’s algorithm [41]. In order to reduce the false positive 


pixels, median filtering is applied on thresholded image. The foreground image for sample frame is shown in 


Figure 5. 


Figure 5. Foreground detection of sample frame of our new dataset 


- Spatial feature computation: Using foreground pixels extracted in the previous step, various spatial 
features are computed. These features describe the distribution of moving objects in the scene, their 
density and distance between them. Following features are computed: 


1. 


Average distance of foreground points from centroid of crowd: This feature describes the overall 
spread of people w.r.t. the centroid. Centroid of foreground region describes the center of the mass 
i.e. mean location around which the moving objects are distributed. It is observed that in case of 
merging situation, average distance of people from centroid keeps reducing; in case of dispersion 
situation it increases while in normal situation it does not change significantly. Centroid of 
foreground region is calculated as (3). 


C(x, y) = HEL? 3) 


n 


where, n = total number of foreground pixels and P(x, y) is the spatial location of each foreground 
pixel P. Average distance from centroid is calculated using (4) and (5): 


a(P) = | (C(x) — PCD)" HCO) - PO) (4) 


n 

dayg= SP (5) 

Rate of change of average distance: Variation in average distance from centroid is shown in Figure 6 

for PETS dataset. We see that when people are merging, the average distance is reducing; when 

people are just standing in a group (i.e. normal situation), the average distance is constant and at its 
minimum level; while in case of dispersion situation, the distance from centroid keeps on increasing. 

From Fig. 6, it can be understood that monitoring the change in the average distance w.r.t. time is 

important to detect the type of event (i.e. merging or dispersion). Hence, we decided to consider rate 

of change of average distance w.r.t. time (computed over a window of 25 frames) as one of the 
distinctive features. 

Rate of change of crowd density: This feature is estimated in three steps: 

a. Estimation of absolute density around centroid: This feature estimates number of people around 
the centroid within certain distance. First, number of foreground points situated in a circular 
region with radius ‘r’ around the centroid ‘C’ is counted. This is defined as ‘absolute density’ 
(refer (6)). 


Absolute density = Xi- f,E{P, (dj < r) (6) 


In order to make the method scene-independent, ‘r’ is not kept constant; but it is assigned a value 
of ‘davo (computed using (5)). Depending on the distribution of people in the scene, ‘days’ 
changes (as demonstrated in Figure 6). In case of dispersion situation, average distance keeps 
increasing while in case of merging it reduces. 

b. Estimation of average density around centroid: For the prediction of crowdedness around 
centroid, average density is estimated using (7). 


Absolute density 


(7) 


Average density = 5 
avg 


Recognition of crowd abnormal activities using fusion of handcrafted and ... (Manasi Pathade) 


1082 O ISSN: 2502-4752 


c. Estimation of Rate of change average density: Variation in average density is shown in Figure 7 
for PETS dataset. It can be seen that, density in the region around the centroid keeps on 
increasing under merging situation. Under normal situation, the density does not change much, 
however when people start dispersing from each other, density goes on reducing. Thus, we 
consider ‘rate of change of average density computed over a time period of 25 frames’ as one of 
the features for event classification. 


Average Distance From Centroid of Crowd 


sf + 


Density of People aroundCentroid of Crowd 
m 
a 
T 
l 


Merging omision 


Figure 7. Variation in average density from centroid w.r.t time 


2.4.2. Motion feature extraction 

Under normal situations, people move leisurely at their comfortable speed. When they experience 
any threat, they suddenly run away from each other. Thus the type of the flow is completely different under 
normal situation and sudden dispersion. Hence in addition to density based spatial features, motion based 
temporal features are also extracted. There are two steps in this process: global motion estimation and feature 
computation. 
- Global motion estimation: It is accomplished using Horn and Schunk Optical Flow algorithm [42], [43]. 

The flow vectors at foreground pixels are only considered for further feature extraction step. 
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- Computation of motion based features: After acquiring the flow vectors, following two features are 
computed: 
1. A’verage magnitude of motion is computed using (8). 


Ii- Vi 
n 


Vavg= (8) 


The velocity magnitude Vi of each flow vector is calculated using (9). 


Viz [Vait Vyi (9) 


Vxi and Vyi are the vertical and horizontal components of flow vectors. 

2. The rate of change of magnitude of motion is then computed over a window of 25 frames. The variation 
in average velocity magnitude is shown in Figure 8 for PETS dataset. Magnitude is seen increasing in 
the initial part where people are entering in the field of view, but after some time it starts reducing as 
people gather at some place. Under normal condition, when there is no significant motion, the magnitude 
is almost constant and is minimum; on the other hand when sudden dispersion occurs, the magnitude 
starts increasing rapidly. The rate of increase in magnitude is clearly visible in the last part of the curve. 
Similar variations in the features are observed on UMN dataset and our datasets. The other attribute of 
motion, i.e. the direction does not show any such discriminative nature (it shows similar distribution 
under merging and dispersion situations) and hence it is not considered in the proposed method. 
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Figure 8. Variation in average magnitude of velocity w.r.t time 


2.5. Fusion of deep features with handcrafted features 

The handcrafted features are fused with the deep features. The five dimensional handcrafted feature 
vector is concatenated with 1024 dimensional feature vector obtained at the output of the last average pooling 
layer of pre-trained GoogleNet. Thus the final feature vector is of dimension 1029*1. This feature vector is 
then used for recognizing the event happening in the scene. 


2.6. Event classification 

The final feature vector is applied to dropout layer before actual classification. This is required to 
avoid overfitting problem which is likely to occur when training datasets are small. During training phase, 
dropout layer temporarily removes some nodes in the network based on a probabilistic threshold. This 
process prevents the nodes to memorize the training data which in turn helps to prevent overfitting. The final 
classification is done by the softmax layer of GoogleNet. GoogleNet is originally trained on ImageNet 
dataset for object recognition task. Hence in the first phase, softmax layer is trained using the training set of 
feature vector S = {8 1,89, +S} along with the corresponding labels L = {]j, l, ++ ly} so that it learns features 
specific to the crowd activity recognition.. Here, N is the total number of training samples, s; € ZÍ, is a 
feature vector of dimension d (d = 1029) and l; E€ {normal, merging, sudden dispersion}. In the second phase 
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predictions are done on unseen test samples t; € Zt. Fine tuning of hyper-parameters is also carried out in 
order to reduce classification loss. The hyper-parameters (such as learning rate, batch size and number of 
epochs) are selected empirically. We select minibatch size as 10, learning rate as 0.0001 (which is kept 
constant for all iterations) and number of epochs as 6. 


3. RESEARCH METHOD 

The proposed method is implemented on windows platform with Intel core 3 processor, 8 GB RAM 
and MATLAB 2019b toolbox. The method is evaluated on public as well as private datasets captured by us. 
Detailed information about these datasets is given below. 


3.1. Publically available datasets 

For crowded scene analysis, various datasets are available such as University of California San 
Diego (UCSD) Pedistrian Dataset (PED1 and PED2), Avenue, University of Minnesotta (UMN) dataset, 
Performance Evaluation of Tracing and Surveillance (PETS), and Chienese University of Hong Kong 
(CUHK) dataset. Out of these, the merging and sudden dispersion situations are found only in UMN and 
PETS datasets. Hence we considered PETS and UMN datasets for evaluating our method. 

PETS 2009 (S3) dataset [44] consists of four videos of outdoor scene each recorded from four 
different view-points. Each scene is of 1 min duration approximately. There are total 377 frames in each 
video. Frame rate is 7fps and resolution is 576 * 768. UMN dataset consists of three scenes captured in 
different environments. Out of these, two are outdoor scenes and one is indoor scene. The total duration of 
the video is 4 min 17 sec. The frame rate is 23 fps and resolution is 240 * 320. There is wide variation in 
illumination conditions and distance of objects from the camera in these scenes. 


3.2. New dataset 

New dataset is created by acquiring CCTV footages at Cummins College campus. It consists of two 
video clips which demonstrate both merging and dispersion situations. These are outdoor scenes captured 
from same viewpoint but on different days and under different illumination conditions. First video clip 
consists of 1100 frames and the second clip consists of 1133 frames. The videos are captured at frame rate of 
25 fps with resolution of 1080 * 1920. Again, every frame is observed manually and then event in the frame 
is labeled as ‘merging event’ or ‘sudden dispersion event’. Table 1 explains the summary of each dataset in 
brief. It can be observed from table I that in all datasets number of frames depicting sudden dispersion and 
merging event are very less as compared to normal frames (i.e. datasets are unbalanced). 


Table 1. Information of different datasets used for performance evaluation 


Dataset # Total Frames # Sudden Dispersion Frames # Merging Frames # Normal Frames 
PETS 14to33 377 42 190 145 
UMN Scene 1 1525 383 -- 1142 
UMN Scene 2 3728 1158 -- 2570 
UMN Scene 3 2070 239 -- 1831 
New dataset Scene 1 1100 54 263 783 
New dataset Scene 2 1133 119 211 803 


4. RESULTS AND DISCUSSION 
4.1. Evaluation parameters 

The performance of our method is quantified in terms of various quantitative parameters such as 
accuracy, precision, recall and F-measure. These parameters are calculated (10) to (13). 


TP+TN 


0 = —— 
% Accuracy — Total Frames 


x100 (10) 


% Precision = —— x100 (11) 
TP 
% Recall = BAN x100 (12) 


2 * Precision * Recall 
% F measure = ———-—— X 100 (13) 


Precision + Recall 
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4.2. Experimental results 

In this section, we present the results of the experimentation carried out to test the effectiveness of 
the proposed method. It is important to note that, to test efficacy of the method even with small datasets 
having large data imbalance, we tested our method without data augmentation or class balancing technique. 
The classifier is trained on few frames of each dataset and tested on the remaining frames of the same dataset. 
The ratio of training-testing frames is kept as 60:40. 

We compare the performance of abnormal event recognition using only deep neural networks with 
the proposed feature fusion method (refer Table 2) on different datasets. The performance parameters are 
calculated for each class separately. We can see that, F-measure and accuracy values are significantly 
improved for recognizing dispersion and merging activities using the proposed feature fusion approach. This 
proves the effectiveness of proposed handcrafted features in recognizing the events on the datasets with high 
class imbalance. 


Table 2. Results of abnormal event recognition using only CNN features and Proposed Method: CNN and 
handcrafted feature fusion 


Dits Method Precision Recall F-measure Acciy 
Dispersion Normal Merge Dispersion Normal Merge Dispersion Normal Merge 
PETS only CNN 100 93.5 100 76.5 100 100 86.68 96.64 100 97.2 
Proposed 100 95.1 100 82.4 100 100 90.35 97.48 100 97.9 
Method 
MVI7 only CNN 100 99.3 92.9 84 100 96.3 91.30 99.64 94.56 98.43 
Proposed 100 98.7 100 100 100 92.6 100 99.34 96.16 99 
Method 
MVI11 only CNN 97.9 100 91.7 100 96.8 100 98.93 98.37 95.67 97.9 
Proposed 100 100 100 100 100 100 100 100 100 100 
Method 
UMN only CNN 99.2 99.68 - 98.54 99.92 - 98.86 99.79 - 99.68 
Proposed 100 99.9 - 99.38 100 - 99.68 99.94 - 99.92 
Method 


4.3. Comparison with other state of the art methods 

The proposed method is compared with other methods w.r.t. accuracy, F-measure and Area Under 
Receiver Operating Characteristics (AUC) on standard publically available UMN dataset. The comparison is 
presented in Table 3. It can be seen that accuracy, F measure and AUC of the proposed method are better 
than other methods. This means that, as compared to other methods, the proposed method can better 
distinguish between normal and abnormal situations. Thus, the proposed method achieves the best 
performance amongst all methods. 


Table 3. Comparison of proposed method with other methods on UMN dataset 


Proposed Abdullah Wu Guo Fang Smeureanu Zhou Ravanbaksh Direkglu 
etal.{14] etal.[7] etal.[12] etal. [21] etal. [19] etal. [16] etal. [25] [24] 

Features Handcraft- Handcraft- Handcraft- Handcraft- Extracted by Extracted by Extracted by Feature Fusion MI + 
used ed+ Deep ed ed ed deep network deep network deep network (OF + AlexNet) Deep 
Training: 60:40 -- Not -- Not specified 80:20 Not specified --- Not 
Testing specified Specified 
Ratio 
Accuracy 99.92 87 99.03 NA“ NA“ NA“ NA“ 98.8 99.08 
F measure 99.68 NA* NA“ NA* 98.81 NA* NA* NA NA 
AUC 99.94 NA‘ NA‘ NA‘ NA 97.1 99.63 98.8 NA 


5. CONCLUSION 

In this paper, we have proposed a method for recognizing two unusual events-sudden dispersion and 
merging in crowded scenes. The proposed method assimilates benefits of deep networks and handcrafted 
features for abnormal event recognition. In the proposed method, we have employed convolutional neural 
network to extract higher level features from video frames. In addition, we propose a novel set of low level 
features to capture temporal variations in crowd density and motion patterns. Feature vector generated by 
fusion of handcrafted features and features extracted by CNN is used for event recognition. The method is 
validated on benchmark datasets as well as our private datasets. Experimental results show that our method 
successfully recognized abnormal events from small datasets having very few abnormal event instances. The 
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results obtained by fusion of deep and handcrafted features are more accurate than that of only deep features. 
This proves that the proposed set of handcrafted features provides additional information. Thus handcrafted 
features complement the features extracted by CNN and their fusion improves the performance of event 
recognition by significant amount. The method is also proved to be more effective as compared to other state 
of the art methods. 
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